MDL-based DCG Induction for NP Identification
نویسنده
چکیده
We introduce a learner capable of automatically extending large, manually written natural language Definite Clause Grammars with missing syntactic rules. It is based upon the Minimum Description Length principle , and can be trained upon either just raw text, or else raw text additionally annotated with parsed corpora. As a demonstration of the learner, we show how full Noun Phrases (NPs that might contain pre or post-modifying phrases and might also be recursively nested) can be identified in raw text. Preliminary results obtained by varying the amount of syntactic information in the training set suggests that raw text is less useful than additional NP bracketing information. However, using all syntactic information in the training set does not produce a significant improvement over just bracketing information. 1 Introduction Identification of Noun Phrases (NPs) in free text has been tackled in a number of ways (for example, [25, 9, 2]). Usually however, only relatively simple NPs, such as 'base' NPs (NPs that do not contain nested NPs or postmodifying clauses) are recovered. The motivation for this decision seems to be pragmatic, driven in part by a lack of technology capable of parsing large quantities of free text. With the advent of broad coverage grammars (for example [15] and attendant efficient parsers [11], however, we need not make this restriction: we now can identify 'full' NPs, NPs that might contain pre and/or post-modifying complements, in free text. Full NPs m'e more interesting than base NPs to estimate: • They are (at least) context free, unlike base NPs which are finite state. They can contain pre-and post-modifying phrases, and so proper identification can in the worst case imply full-scale pars-ing/grammar learning. • Recursive nesting of NPs means that each nominal head needs to be associated with each NP. Base NPs simply group all potential heads together in a flat structure. As a (partial) response to these challenges, we identify full NPs by treating the task as a special case of full-scale sentential Definite Clause Grammar (DCG) learning. Our approach is based upon the Minimum Description Length (MDL) principle. Here, we do not explain MDL, but instead refer the reader to the literature (for example, see [26, 27, 29, 12, 22]). Although a DCG learning approach to NP identification is far more computationally demanding than any other NP learning technique reported, it does provide a useful test-bed for exploring some of the (syntactic) factors involved …
منابع مشابه
DCG Induction using MDL and Parsed
We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon estimation. Empirical evaluation suggests that a parsed corpus is more informative than a MDL-based prior. However , best r...
متن کاملA Simple Transformation for O ine-Parsable Grammars and its Termination Properties
We present, in easily reproducible terms, a simple transformation for o ineparsable grammars which results in a provably terminating parsing program directly top-down interpretable in Prolog. The transformation consists in two steps: (1) removal of empty-productions, followed by: (2) left-recursion elimination. It is related both to left-corner parsing (where the grammar is compiled, rather tha...
متن کاملComplexity of Model Checking for Modal Dependence Logic
Modal dependence logic (MDL) was introduced recently by Väänänen. It enhances the basic modal language by an operator =(·). For propositional variables p1, . . . , pn the atomic formula =(p1, . . . , pn−1, pn) intuitively states that the value of pn is determined solely by those of p1, . . . , pn−1. We show that model checking for MDL formulae over Kripke structures is NPcomplete and further co...
متن کاملProperties of Bayesian Belief Network Learning Algorithms
In this paper the behavior of various be lief network learning algorithms is stud ied. Selecting belief networks with cer tain minimallity properties turns out to be NP-hard, which justifies the use of search heuristics. Search heuristics based on the Bayesian measure of Cooper and Her skovits and a minimum description length (MDL) measure are compared with re spect to their properties for...
متن کامل